{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Training and Deploying the Fraud Detection Model\n",
    "\n",
    "In this notebook, we will take the outputs from the Processing Job in the previous step and use it and train and deploy an XGBoost model. Our historic transaction dataset is initially comprised of data like timestamp, card number, and transaction amount and we enriched each transaction with features about that card number's recent history, including:\n",
    "\n",
    "- `num_trans_last_10m`\n",
    "- `num_trans_last_1w`\n",
    "- `avg_amt_last_10m`\n",
    "- `avg_amt_last_1w`\n",
    "\n",
    "Individual card numbers may have radically different spending patterns, so we will want to use normalized ratio features to train our XGBoost model to detect fraud. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Imports "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from sagemaker.inputs import TrainingInput\n",
    "from sagemaker.session import Session\n",
    "from sagemaker import image_uris\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import sagemaker\n",
    "import boto3\n",
    "import io"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Essentials "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "LOCAL_DIR = './data'\n",
    "BUCKET = sagemaker.Session().default_bucket()\n",
    "PREFIX = 'training'\n",
    "\n",
    "sagemaker_role = sagemaker.get_execution_role()\n",
    "s3_client = boto3.Session().client('s3')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's load the results of the SageMaker Processing Job ran in the previous step into a Pandas dataframe. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(f'{LOCAL_DIR}/aggregated/processing_output.csv')\n",
    "#df.dropna(inplace=True)\n",
    "df['cc_num'] = df['cc_num'].astype(np.int64)\n",
    "df['fraud_label'] = df['fraud_label'].astype(np.int64)\n",
    "df.head()\n",
    "len(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Split DataFrame into Train & Test Sets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The artifically generated dataset contains transactions from `2020-01-01` to `2020-06-01`. We will create a training and validation set out of transactions from `2020-01-15` and `2020-05-15`, discarding the first two weeks in order for our aggregated features to have built up sufficient history for cards and leaving the last two weeks as a holdout test set. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "training_start = '2020-01-15'\n",
    "training_end = '2020-05-15'\n",
    "\n",
    "training_df = df[(df.datetime > training_start) & (df.datetime < training_end)]\n",
    "test_df = df[df.datetime >= training_end]\n",
    "\n",
    "test_df.to_csv(f'{LOCAL_DIR}/test.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although we now have lots of information about each transaction in our training dataset, we don't want to pass everything as features to the XGBoost algorithm for training because some elements are not useful for detecting fraud or creating a performant model:\n",
    "- A transaction ID and timestamp is unique to the transaction and never seen again. \n",
    "- A card number, if included in the feature set at all, should be a categorical variable. But we don't want our model to learn that specific card numbers are associated with fraud as this might lead to our system blocking genuine behaviour. Instead we should only have the model learn to detect shifting patterns in a card's spending history. \n",
    "- Individual card numbers may have radically different spending patterns, so we will want to use normalized ratio features to train our XGBoost model to detect fraud. \n",
    "\n",
    "Given all of the above, we drop all columns except for the normalised ratio features and transaction amount from our training dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "training_df.drop(['tid','datetime','cc_num','num_trans_last_10m', 'avg_amt_last_10m',\n",
    "       'num_trans_last_1w', 'avg_amt_last_1w'], axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [built-in XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) requires the label to be the first column in the training data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "training_df = training_df[['fraud_label', 'amount', 'amt_ratio1','amt_ratio2','count_ratio']]\n",
    "training_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train, val = train_test_split(training_df, test_size=0.3)\n",
    "train.to_csv(f'{LOCAL_DIR}/train.csv', header=False, index=False)\n",
    "val.to_csv(f'{LOCAL_DIR}/val.csv', header=False, index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 cp {LOCAL_DIR}/train.csv s3://{BUCKET}/{PREFIX}/\n",
    "!aws s3 cp {LOCAL_DIR}/val.csv s3://{BUCKET}/{PREFIX}/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# initialize hyperparameters\n",
    "hyperparameters = {\n",
    "        \"max_depth\":\"5\",\n",
    "        \"eta\":\"0.2\",\n",
    "        \"gamma\":\"4\",\n",
    "        \"min_child_weight\":\"6\",\n",
    "        \"subsample\":\"0.7\",\n",
    "        \"objective\":\"binary:logistic\",\n",
    "        \"num_round\":\"100\"}\n",
    "\n",
    "output_path = 's3://{}/{}/output'.format(BUCKET, PREFIX)\n",
    "\n",
    "# this line automatically looks for the XGBoost image URI and builds an XGBoost container.\n",
    "# specify the repo_version depending on your preference.\n",
    "xgboost_container = sagemaker.image_uris.retrieve(\"xgboost\", sagemaker.Session().boto_region_name, \"1.2-1\")\n",
    "\n",
    "# construct a SageMaker estimator that calls the xgboost-container\n",
    "estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, \n",
    "                                          hyperparameters=hyperparameters,\n",
    "                                          role=sagemaker.get_execution_role(),\n",
    "                                          instance_count=1, \n",
    "                                          instance_type='ml.m5.2xlarge', \n",
    "                                          volume_size=5, # 5 GB \n",
    "                                          output_path=output_path)\n",
    "\n",
    "# define the data type and paths to the training and validation datasets\n",
    "content_type = \"csv\"\n",
    "train_input = TrainingInput(\"s3://{}/{}/{}\".format(BUCKET, PREFIX, 'train.csv'), content_type=content_type)\n",
    "validation_input = TrainingInput(\"s3://{}/{}/{}\".format(BUCKET, PREFIX, 'val.csv'), content_type=content_type)\n",
    "\n",
    "# execute the XGBoost training job\n",
    "estimator.fit({'train': train_input, 'validation': validation_input})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ideally we would perform hyperparameter tuning before deployment, but for the purposes of this example will deploy the model that resulted from the Training Job directly to a SageMaker hosted endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor = estimator.deploy(\n",
    "    initial_instance_count=1, \n",
    "    instance_type='ml.t2.medium',\n",
    "    serializer=sagemaker.serializers.CSVSerializer(), wait=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "endpoint_name=predictor.endpoint_name\n",
    "#Store the endpoint name for later cleanup \n",
    "%store endpoint_name\n",
    "endpoint_name"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now to check that our endpoint is working, let's call it directly with a record from our test hold-out set. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "payload_df = test_df.drop(['tid','datetime','cc_num','fraud_label','num_trans_last_10m', 'avg_amt_last_10m',\n",
    "       'num_trans_last_1w', 'avg_amt_last_1w'], axis=1)\n",
    "payload = payload_df.head(1).to_csv(index=False, header=False).strip()\n",
    "payload"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "float(predictor.predict(payload).decode('utf-8'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Show that the model predicts FRAUD / NOT FRAUD"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "count_ratio = 0.30\n",
    "payload = f'1.00,1.0,1.0,{count_ratio:.2f}'\n",
    "is_fraud = float(predictor.predict(payload).decode('utf-8'))\n",
    "print(f'With transaction count ratio of: {count_ratio:.2f}, fraud score: {is_fraud:.3f}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "count_ratio = 0.06\n",
    "payload = f'1.00,1.0,1.0,{count_ratio:.2f}'\n",
    "is_fraud = float(predictor.predict(payload).decode('utf-8'))\n",
    "print(f'With transaction count ratio of: {count_ratio:.2f}, fraud score: {is_fraud:.3f}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
